环球•译事 | 纽约大学研究员:“神经机翻”为何会成为主流?
作为智库型研究与资讯平台,译世界【官方微信号译•世界(YEEWORLD)】现推出“环球•译事”栏目,聚焦全球语言服务行业,以专业的视角、前沿的眼光,通过双语译介、原创策划等多种形式,评述行业现象、观察业态发展,欢迎关注!
本期为第一期,带您了解神经网络机器翻译为何会成为机器翻译的主流。
Neural machine translation (NMT) is now mainstream. This was New York University Assistant Professor Kyunghyun Cho’s first message during his presentation on NMT at the recent SlatorCon New York on October 12, 2017.
如今,神经网络机器翻译(NMT)已经成为主流。这是纽约大学助理教授Kyunghyun Cho在10月12日于纽约举办的SlatorCON 论坛上,介绍NMT时透露的第一个信息。
When Cho’s team started looking into NMT in 2013 and 2014, he said previous MT researchers and industry insiders were convinced it would not work. Efforts in the 1980s and mid-1990s failed, after all.
当Cho的团队在2013年和2014年开始研究NMT时,他说,之前的机器翻译研究人员和业内人士都深信这是行不通的。毕竟,20世纪80年代和90年代中期的努力都失败了。
Fast forward to 2017, Cho pointed out that big names like Google, Microsoft, and Facebook use NMT, and sites like Booking.com and even the European Patent Office have all caught the NMT bug.
时间快速拉到2017年,Cho指出谷歌、微软和Facebook等大公司都使用了NMT,而一些网站如Booking.com,甚至欧洲专利局都发现了NMT的漏洞。
“So it’s mainstream,” Cho concluded. He added though, that research was continuous and ongoing, despite existing NMT systems outperforming statistical models that have been in place and aided by improvements for over ten years.
“所以,NMT已成为主流。” Cho总结道。他还补充说,尽管现有的NMT系统已经超越了历经10多年改进的统计模型,但关于NMT的研究仍在继续。
“Somehow Nobody has Tried It”
“不知为何,没人试过”
The key difference lies in how Cho and fellow researchers approached the problem. “So far a lot of the research on machine translation has been focused on sub-word level translation,” Cho said. “That is looking at a sentence as a sequence of sub-words.”
Cho及其同事和其他研究员的关键不同在于如何处理这个问题。“目前为止,很多机器翻译的研究都聚焦于子词级翻译,”Cho 表示,“就是把句子看作一序列子词。”
Cho and his co-researchers decided to go down to character-level modelling.
Cho 和他的合作研究员决定建立字符级模型。
“In 2016 we decided to try it out; somehow nobody has tried it,” he said. “When a new technology comes in, what everyone tries to do is use the new technology to mimic what you were able to do with the old technology. So everyone was stuck with morpheme-level or word-level modelling and then somehow forgot to try this new technology on new ways of representing a sentence, that is view it as a sequence of characters.” And the results were telling.
“2016年,我们决定试试看,不知道为什么,之前没有人尝试。”Cho 说,“当一项新技术出现时,每个人都试图使用新技术来模仿旧技术所能做的事情。所以大家都局限于语素级或字词级的模型中,而忘记将新技术用作处理句子的新方法,即将句子看作一序列字符。”结果说明了一切。
Record Breaking
打破记录
“This model beats any single paired model you can think of,” Cho said, reporting how the NMT system performed either on the same level or—and often—better than existing MT models when assessed through BLEU (bilingual evaluation understudy) scores or even human evaluation.
Cho说:“这个模型打败了你所能想到的所有单一配对模型。”在BLEU评估(译者注:一项国际最受欢迎的评估指标,可以评测出内容翻译的准确率),甚至是人类评估中,NMT的表现和现有机器翻译模型相比,在同一水平,而且常常表现得更好。
Cho also highlighted some other advantages to NMT aside from better quality, such as its robust handling of spelling mistakes and morphology. Another pleasant surprise was how the NMT system can translate into compound words that rarely appear in a training corpus the size of 100 million words.
除了翻译质量更好以外,Cho还强调了NMT的其他优势,比如它在拼写错误和词法上强大的处理能力。此外,另一个意外之喜是,NMT能够翻译转化一亿字规模的训练语料库极少出现的复合词。
One breakthrough in particular was quite promising: the NMT system can translate into a desired target language even without knowing the source language.
尤其是有一项非常有前景的重大突破:NMT系统即使在不知道源语言的情况下也能翻译成指定的目标语。
Cho’s team trained their NMT system to translate from German, Czech, Finnish, and Russian to English. They then tasked the system to translate any given sentence into English without providing a language identifier.
Cho的团队训练他们的NMT系统将德语、捷克语、芬兰语和俄语翻译成英语,然后给系统指派任务,在不提供语言识别的情况下将任意指定的句子翻译成英语。
“The decoder doesn’t care which source language it was written in, it’s just going to translate into the target language,” Cho said. “Now, since our model is actually of the same size as before, we are saving around four times the parameters. Still, we get the same level or better performance.”
“解码器无需知道写入的源语言是什么语,它只需将其翻译成目标语。”Cho说。“现在,我们的模型其实和以前一样大,所以我们大概保存了4倍的参数。但是,它的性能和以前一样好,有时会表现得更好。”
They took the experiment a step further and fed the system a sentence written in three different languages. The system did the translation without any external indication which part of the sentence was written in which language, proving the model automatically learns how to handle code-switching within a sentence.
他们在实验中又进行了更进一步的研究,将一个句子用三种语言写入系统。在没有外部标识这个句子的哪一部分是什么语言的情况下,系统对其进行了翻译,事实证明,这个模型能够自主学习怎样在一个句子中进行代码转换。
Finally, Cho touched on low resource languages. What his team and other NMT research teams across the globe have found is that as their system learns shared similarities across languages, it can actually apply learnings from high resource languages to low resource ones and improve their translation.
最后,Cho又谈到了低资源语言。他的团队以及世界其他的NMT研究团队发现,由于NMT系统可以学习不同语言之间的相似之处,所以它其实可以应用于高资源语言到低资源语言的学习,提高语言的翻译能力。
The Future is “Extremely Fast-Moving”
未来“变化很快”
Cho saved cutting-edge for last: non-parameter NMT. He says this system translates the way a human translator would: by leveraging translation memory (TM) as an on-the-fly training set.
Cho最近研发的无参数NMT是一项最前沿的技术。他说这个系统将翻译记忆作为一个动态训练集,以人工翻译的方式进行翻译。
This way, the NMT system acts like a translator and does not need an entire training corpus in its database, but instead accesses relevant TMs to translate. Cho commented that this system actually displays higher consistency in style and vocabulary choice.
这样,NMT系统就像译员一样,无需将整个训练语料库存在数据库中,而是利用相关的翻译记忆进行翻译。Cho评价道,这个系统其实在句式和词汇的选择上更加连贯。
Finally, Cho closed his presentation on state-of-the-art NMT by explaining the future direction of NMT research.
Cho在其关于先进的NMT技术演讲结束时,阐明了NMT研究的未来方向。
First, low resource language translation is a priority. Second, he said there is already some body of work on zero resource translation. The third and last direction is better handling of Chinese, Japanese, and Korean translation.
第一,低资源语言翻译是当务之急;第二,他说,现在已经出现了一些零资源翻译的研究成果;第三,中文、日语和韩语的翻译处理能力会更好。
Later on in the panel session, Cho fielded a question about the biggest challenge in NMT.
在随后的小组会议上,Cho很好地回应了NMT所面临的最大挑战的问题。
He said hundreds of people have been working on MT for over 30 years, and research on NMT has been going on for about three years. “It’s only the apparent disruption you see,” Cho said, explaining that it will be hard to tell what kind of disruption will result from incremental advances in research.
他指出,过去30年里,成百上千的人一直在致力于机器翻译的研究,对NMT的研究也持续了近三年时间。“人们看到的只是表面上的颠覆。”Cho解释道,随着研究的不断推进,很难弄清将来会产生怎样的颠覆。
“Even if I can tell you the challenges that I’m working on at the moment, that probably won’t tell you or anybody how the next disruption is going to happen,” he said.
他说:“即使我告诉你目前我正在应对的挑战是什么,或许也没人能从中知道接下来的颠覆将以怎样的方式到来。”
Pondering how fast these breakthroughs make it to market, May Habib, CEO, Qordoba, asked after the presentation how long it takes between research breakthrough and deployment in the field.
Qordoba公司总裁May Habib关心这些突破性进展多快可以面市,于是在Cho做完陈述后问道从研究实现突破到实际应用需要多长时间。
Cho pointed out that they published their first paper on NMT in 2015, and the first big commercial announcement regarding application was from Google Translate in September 2016. He added that though Google did not disclose details of their deployment, Facebook still managed to launch their own NMT system a year later.
Cho指出他们于2015年发表了首篇关于NMT的论文,2016年9月,谷歌翻译首次重磅宣布商用。他补充道,虽然谷歌没有披露实际应用的详情,但Facebook在一年后发布了自己的NMT系统。
“It’s an extremely fast-moving field in that every time there is some change, we see the improvement,” Cho said. “So you gotta stay alert.”
Cho说:“这是个发展速度极快的领域,每时每刻都在发生变化,我们能看到进步,所以要时刻留意。”
Glossary:
neural machine translation (NMT) 神经网络机器翻译
the European Patent Office 欧洲专利局
statistical model 统计模型
sub-word level 子词级
character-level 字符级
morphology 词法
single paired model 单一配对模型
compound words 合成词
source language 源语言
target language 目标语;目的语
decoder 解码器;译码器;译码员
parameter 参数;系数;参量
code-switching 代码转换
low resource languages 低资源语言
high resource languages 高资源语言
zero resource translation 零资源翻译
on-the-fly 〈非正式〉匆匆忙忙地;在空中;飞行中;[计] (计算机)运行中,动态的
training set 教练组;[计] 训练集;训练区
translation memory (TM) 翻译记忆
corpus [计] 语料库;文集;本金
编译:Yee君
推荐阅读